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Abstract 

In recent years, many methods have been developed for detecting causal 
relationships in observational data. Some of them have the potential to 
tackle large data sets. However, these methods fail to discover a combined 
cause, i.e. a multi-factor cause consisting of two or more component variables 
which individually are not causes. A straightforward approach to uncovering 
a combined cause is to include both individual and combined variables in the 
causal discovery using existing methods, but this scheme is computationally 
infeasible due to the huge number of combined variables. In this paper, we 
propose a novel approach to address this practical causal discovery problem, 
i.e. mining combined causes in large data sets. The experiments with both 
synthetic and real world data sets show that the proposed method can obtain 
high-quality causal discoveries with a high computational efficiency. 


Keywords: Causal discovery. Combined causes. Local causal discovery, 
HITON-PC, Multi-level HITON-PC 


1. Introduction 

Causal relationships can reveal the causes of a phenomenon and predict 
the potential consequences of an action or an event [^. Therefore, they are 
more useful and reliable than statistical associations [3, S 13 • 

In recent decades, causal inference has ‘' 



puter science. Causal Bayesian networks 


a main framework for representing causal relationships and uncovering them 
in observational data. Due to the incapability of CBNs in coping with high 
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Figure 1: Multiple individual causes vs. the combined cause, where solid 
arrows denote causal relationships and the dashed lines represent the inter¬ 
action between the two variables. 


dimensional data, some efficient methods were proposed for local causal dis¬ 
covery around a target variable 
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One limitation of current causal discovery methods is that they only find 
a cause consisting of a single variable. However, single causal factors are 
often insufficient for reasoning about the causes of particular effects [Hj. For 
example, a burning cigarette stub and inflammable material nearby can start 
a hre, but neither of them alone may cause a fire. With gene regulation, it 
was found that the expression level of a gene might be co-regulated by a 
group of other genes, which could lead to a disease [l^, 31 . Furthermore, 
a main objective of data mining is to find previously unobserved patterns 
and relationships in data. Causal relationships between single variables are 
easier to be identified by domain experts, but combined causes are much 
more difficult to be detected 1^. Hence data mining methods for discovering 
combined causes are in demand. In this paper, we address the problem of 
hnding combined causes in large data sets. 

The combined causes considered in this paper are different from the gener¬ 
ally discussed multiple causes. For example, in Figure [H sprinkler causes wet 
ground, and so does rain. Sprinkler and rain together cause wetter ground. 
However, in this work, we concern the situation when multiple variables each 
alone is not sufficient to cause an effect, but their combination is. As shown 
in Figured! there is no causal link from burning cigarette stub or inflammable 
material to a hre, but the combination of these two factors leads to a hre. 

The combined causes studied in this paper cannot be discovered with 
CBN learning, as in a CBN an edge is drawn from A to C only when A is a 
cause of C. If A and B each alone is not a cause of C, no edge is drawn from 
A or H to C, and thus impossible to examine the combined causal ehect of 
A and B on C. This limitation of CBNs was discussed in [l^ (page 48) as 
follows: 
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“Suppose drugs A and B both reduce symptoms C, but the effect of 
A without B is quite trivial, while the effect of B alone is not. The 
directed graph representations we have considered in this chapter offer 
no means to represent this interaction and to distinguish it from other 
circumstances in which A and B alone each have an effect on C.” 

To identify combined causes in data, one critical challenge is the compu¬ 
tational complexity with large data sets, as the number of combined variables 
is exponential to the number of individual variables. 

In this paper, we propose a multi-level approach to discovering the com¬ 
bined causes of a target variable. Our method is designed based on an effi¬ 
cient local causal discovery method, HITON-PC jsl, which was developed on 
the same theoretical ground as the well-known PC algorithm [l^ for CBN 
learning. 

In the rest of the paper, the related work and the contributions of this 
paper are described in Section |2J Section E] introduces the background, in¬ 
cluding the notation and the HITON-PC algorithm. Section H] presents the 
proposed method. The experiments and results are described in Section |5l 
Finally, Section |6] concludes the paper. 


2. Related Work and Contributions 


As discussed in the previous section, causal Bayesian networks (CBNs) 
as a main stream causal discovery approach, have been studied extensively. 
Many algorithms for CBN learning and inference 0, M, H 


19 have been 


developed. Researchers have also tried to incorporate other models and prior 
knowledge into the CBN framework. The domain experts are interested 
in taking the prior knowledge and observational data to produce Bayesian 
networks j^. Messaoud et ah proposed a framework to learn CBNs, 
by incorporating semantic background knowledge provided by the domain 
ontology. In order to address the uncertainties resulting from incomplete 
and partial information, Kabir et ah combined Bayesian belief network 
with data fusion model to predict the failure rate of water mains. However, 
these methods are designed to analyse individual causes, instead of combined 
causes. Moreover, it may be difficult for domain experts to elicit the CBN 
structure with combined causes from domain knowledge only. 

Another approach [^, was proposed to hud the relationship structures 
between groups of variables. Segal et ah dehned the module network of 
which each node (module) was formed by a set of variables having the same 
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statistical behavior. They also proposed an algorithm to learn the module 
assignment and the module network structure. Many algorithms and appli¬ 
cations 25, 0 have been developed to extend the module network model. 
Yet et ah |26| proposed a method for abstracting the BN structure, where 
they also merged nodes with similar behavior to simplify the BN structure. 
The modules or nodes of a module network are not the same as the combined 
causes defined in this paper, since the components of a combined cause do 
not necessarily have the similar behaviour. 

The sufficient-component cause model HQ (often referred by epidemi¬ 
ologists) addresses the combined causes discussed in this paper. According 
to the model, a disease is an inevitable consequence of a minimal set of fac¬ 
tors. However, no computational methods have been developed for hnding a 
sufficient-component cause in observational data. Although the model and 
interactive causes have attracted statisticians’ attentions 0, 28, 0. 
work is at the level of theoretical discussions. 

Li et al 
et al. 


the 


32 


10[ used the idea of retrospective cohort studies 
applied partial association tests 


31 and Jin 


33 to discover causal rules from 


association rules. While the work has initiated the concept of the combined 
causes, their focus was on integrating association rule mining with observa¬ 
tional studies or traditional statistical analysis for causal discovery. 

In this paper, a novel method is proposed to discover the combined causes 
of the given target variable, based on the causal inference framework estab¬ 
lished for CBN learning. The contributions of this paper are summarised as 
follows: 


1. We study the problem of mining combined causes which are different 
from multiple individual causes, and the problem has not been tackled 
by most existing methods. 

2. We develop a new method for discovering combined (and single) causes, 
and demonstrate its performance and efficiency by experiments with 
synthetic and real world data. 


3. Background 

In this section, we firstly describe the notation to be used in the paper 
fSection l3.ip . In Section we introduce the HITON-PC algorithm, which 
is the basis of our algorithms, and then discuss its time complexity. 
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3.1. Notation 

We use upper case letters, e.g. X and Y, to represent random variables, 
and multiple upper case letters, e.g. XY or X&:Y, to denote the combined 
variable consisting of X and Y. Bold-faced upper case letters, e.g. X and 
Y, represent a set of variables. Particularly, we denote the set of predictor 
variables and the target variable with V and T respectively. The conditional 
independence between X and T given S is represented as I{X,T \ S). 

This paper deals with binary variables only, i.e. each variable has two 
possible values, 1 or 0. The value of a combined (binary) variable XY is 
1 if and only if each of its component (binary) variables is equal to 1 (i.e. 
X = 1 and Y = 1). A multi-valued variable can be converted to a number 
of binary variables, e.g. the nominal variable Eduction can be converted 
to 3 binary variables. High School, Undergraduate and Postgraduate. With 
binary variables, we can easily create and examine a combined cause involving 
different values of multiple variables. For example, given the two nominal 
variables. Gender and Eduction, after converting them to binary variables, 
we can combine them to have variables, such as (Male, High School) and 
(Female, Postgraduate). 


3.2. HITON-PC 

Given its high efficiency and origin in the sound GBN learning theory, 
HITON-PG is a commonly used method for discovering local causal struc¬ 
tures with a hxed target variable. The semi-interleaved HITON-PG is used 
as the basis for our proposed method. Under the causal assumptions 15 


HITON-PG uses conditional independence (GI) tests to hnd the causal rela¬ 
tionships around a target variable T, i.e. the set of parents (P) and children 
(G) of T. 

Referring to Algorithm [H HITON-PG takes a data set of the predictors 
V and the target T to produce TPC{T), the set of parents and children of 
T. The algorithm uses two data structures, a priority queue OPEN and a 
list TPC{T). Initially OPEN contains all predictors associated with T and 
TPC{T) is empty (see lines 1 and 2 of Algorithm [T]). It then iterates between 
the two phases, inclusion and elimination, until OPEN becomes empty. 

In the inclusion phase, the variable having the strongest association with 
T is removed from OPEN and added to TPC{T) (line 4). In the elimination 
phase, if OPEN is not empty, the forward stage (lines 5-9) is executed. The 
variable newly added to TPC{T), X, is eliminated from TPC{T) if it is 
independent of T given a subset of current TPC{T), otherwise it is kept (still 
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ALGORITHM 1: The Semi-interleaved HITON-PC Algorithm 
Input: A data set D for predictor variable set V and target T 
Output: TPC{T), the set of parents and children of T 
1: TPC{T) ^ 0 

2: Let OPEN contain all variables associated with T, sorted in descending 
order of strength of associations. 

3: repeat 

Phase I: Inclnsion (line 4) 

4: Move the hrst variable X from OPEN to the end of TPC{T) 

Phase II: Elimination (lines 5-16) 

// Eorward stage 
5: if OPEN ^ 0 then 

6: A <(— the variable last added to TPCiT) 

7: if 35 C TPC{T)\{X}, s.t. /(A,T | 5) then 

8: Remove X from TPC{T) 

9: end if 

/ / Backward stage 

10: else 

11: for each X G TPC{T) do 

12: if 35 c TPC{T)\{X], s.t. /(A,r I 5) then 

13: Remove X from TPC{T) 

14: end if 

15: end for 

16: end if 

17: until OPEN = 0 
18: Outpnt TPCiT) 


tentatively) in TPC{T). If OPEN is empty, the backward stage (lines 10-16) 
is activated, and each variable X in current TPC{T) is tested, and if a subset 
of TPC{T) is found such that X is independent of T given the subset, X is 
removed from TPCiT). 

HITON-PC uses several heuristics to improve efficiency. At the forward 
stage. Cl tests are conducted only on the newly added variable, instead 
of performing a full variable elimination. To compensate for possible false 
discoveries caused by this heuristic, HITON-PC uses the backward stage 
to “tighten up” TPCiT) by testing the conditional independence of each 
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variable with T given the other variables in TPC{T). Moreover, the use of 
the priority queue, OPEN, allows variables having stronger associations with 
T to be included and evaluated hrst. As these variables are more likely to 
be the true parents or children of the target, once they are in TPC{T), it 
is expected that given these variables, other variables that should not be 
in TPC{T) are quickly identihed and removed so that in the forward stage 
TPC{T) will not be over expanded, thus reducing the number of Cl tests. 
Additionally, in practice, HITON-PC restricts the maximum order of Cl tests 
to a given threshold max-k, i.e. the maximum size of the conditioning set 
see Algorithm [1]) is max-k. 

The time complexity of HITON-PC mainly depends on the number of Cl 
tests. Each variable needs to be tested on all subsets of TPC(T). Thus the 
complexity regarding each variable is and the total time com¬ 
plexity is 0(1 When max-k is specihed, the complexity becomes 

polynomial, i.e. 0{\V\\TPC(T)\'^°‘^~^). Extensive experiments have shown 
that HITON-PC is able to cope with thousands of variables with low rate of 
false discoveries j8|. 

4. Uncovering Combined Causes 

Having introduced the background knowledge, in this section, we present 
the proposed method for discovering combined causes. We hrstly introduce 
the naive approach fSection 14.11) . which is a straightforward way to detect 
the combined causes. Then we give the formal dehnition of combined causes 
and present the basic idea of the proposed method ISection 14.2p . Finally, 
we describe the proposed method (including two algorithms, MH-PC-F and 
MH-PC-B) for discovering combined causes (Section 14.3p and discuss their 
possible false discoveries fSection l4.4p . 

4.I. The Naive Approach 

A naive scheme for Ending the combined causes can be as follows. Firstly, 
we generate a new variable set with combined variables using the original 
variable set. For example, for V = {A,B,C,D,E,X,Y,Z^, the new vari¬ 
able set (with 2® — 1 variables) is V = {A, AB ,..., YZ, ABC ,..., XYZ, 
..., ABCDEXYZ}. Then we run a local causal discovery algorithm, such as 
HITON-PC, to find both single and combined causes using the data set cre¬ 
ated for V'. 
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The naive approach, however, is not feasible because the number of com¬ 
bined variables is exponential to the number of individual variables. In the 
following, we discuss how our proposed method tackles the problem. 

4-2. Basic Idea of the Proposed Method 

In fact, it is not necessary to consider all combined variables. Particularly, 
we are not interested in a combined variable, e.g. W = XY, of which a 
component X or F is a cause already, as it is reasonable to assume that the 
causal relationship between W and the target T is due to the relationship 
between X or Y and T. To improve efficiency, we can exclude such combined 
variables when hnding combined causes of T, and only consider the combined 
variables whose components are not causes of T. 

Furthermore, as discussed in Section [H a combined cause consisting of 
non-cause components is more difficult to be observed by domain experts 
and they cannot be represented or discovered using other approaches such as 
CBNs, hence mining such combined causes is useful in practice. 

The dehnition of the combined causes studied in this paper is given below. 


Definition 1 (Combined Cause). Let W be a combination of multiple 
variables. W is a combined cause of T if W is a cause of T and any of 
its component variables, X G W, is not a cause of T. 

Based on Dehnition [H we can design an algorithm to hnd such combined 
causes (and single causes) in a level by level manner. We hrstly, at the 
(fc—1)*^ level {k > 2), obtain TPCk-i{T), the set of parents and children of T, 
each consisting of k — 1 individual variables. Then at the level, combined 
variables are generated based on nTPC(T), the set of non-cause variables 
at all the lower levels, i.e. nTPC{T) = nTPCi{T) U ... U nTPCk-i{T), 
where nTPCi{T) {i G {1,...,A; — 1}) is the set of non-cause variables each 
containing i individual variables. For example, a k^^ level combined variable 
can be generated by combining an level {i G {l,...,/c — 1}) non-cause 
variable and a. {k — iY^ level non-cause variable. 

For non-cause variables X and Y, if the combination XY is a combined 
cause, then it is reasonable to assume that X positively contributes to the 
relationship between Y and the target T and vice versa. For example, the 
combustible dust suspended in the air (even at a high concentration) has no 
causal effect on a dust explosion, but an ignition source will improve their 



relationship significantly and thus the combination of the two factors can 
result in a dust explosion. This observation leads to the following definition. 

Definition 2 (Redundant Combined Variable). For yX,Y E V, the 
combination XY is a redundant combined variable, if either I{X,T\Y = 1) 
or I{Y,T\X = 1), where X and Y are not individual causes of the target T. 

By excluding redundant combined variables, we can further improve the 
efficiency of the causal discovery. 

Based on the above discussion and HITON-PC, we propose the Multi¬ 
level HITON-PC (MH-PC) method for finding both single and combined 
causes of a given target. In the following section, we present the details of 
the method. 

4.3. Multi-level HITON-PC 

Referring to Algorithmic], at the first level, MH-PC invokes HITON-PC 
to hnd the single causes of T, TPCiiT) (line 1) and initiates TPC{T) as 
TPCi{T) (line 2). The single non-cause variables are put in nTPCi{T) and 
nTPC{T) (non-cause variables identified at all the lower levels) is initially 
empty (line 2). 

At level k (k > 2), MH-PC firstly updates nTPC{T) so that it contains 
the non-causes from levels 1 to fc — 1 (line 4). Next the algorithm generates 
combined variables containing k individual variables by combining the vari¬ 
ables in nTPC{T) (line 5). Redundant combined variables are then removed 
(line 6) and the new data set Dk for level k combined variables (14) is cre¬ 
ated too (line 7). From lines 8 to 23, we identify level k combined causes 
from 14 . Initially OPPN contains all the combined variables in 14 which 
are associated with T and the variables are sorted in descending order of the 
strength of associations (line 8). Similar to HITON-PC, the inclusion and 
elimination phases are carried out iteratively till OPPN is empty. At the end 
of the iteration (line 23), TPCkiT) includes the discovered combined causes 
consisting of k variables, and TPC{T) includes all the causes from level 1 to 
level k. Note that in line 17, to improve the efficiency further, the backward 
stage only checks the level k candidates in TPCk{T), instead of all candidates 
in TPC{T), as all the lower level parents and children have been confirmed 
at previous levels. 

In line 24, the set of level k non-causes is updated before completing 
the work at level k. Finally MH-PC outputs TPC(T) until k = kmax (the 
maximum level of causal discovery). 
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ALGORITHM 2: Multi-level HITON-PC (MH-PC) 

Input: A data set D for predictor variable set V and target T; kmax, 
the maximum level of causal discovery 

Output: TPC{T), the set of (single and combined) parents and 
children of T 

1 : Call Algorithm OD (HITON-PC), i-e. rPC'i(T) = HITON-PC(i:>, V, T) 
2 : TPC{T) = TPCi{T); nTPCi{T) = V\TPCi{T); nTPC{T) = 0 
3: for k = 2 to kmax do 
4: nTPCiT) = nTPCiT) U nTPCk-i{T) 

5: Generate k^^ level combined variable set V' based on nTPC{T) 

6 : 14 = redundancyTest(V') 

7: Generate corresponding data set for 14 

8 : Let OPEN contain all variables (in 14) associated with T, sorted 

in descending order of strength of associations. 

9: repeat 

Phase I: Inclusion (line 10) 

10 : Move the hrst variable from OPEN, add it to the end of TPC(T) 

and TPCk{T) 

Phase II: Elimination (lines 11-22) 

// Eorward stage 
11: if OPEN ^ 0 theu 

12 : X the variable last added to TPC{T) 

13: if 35 C TPC{T), s.t. I{X,T \ S) theu 

14: Remove X from TPC{T) and TPCk{T) 

15: eud if 


16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 


/ / Backward stage 

else 


for each A G TPCkiT) do 
if 35 C TPC{T)\{X], s.t. I{X,T\ 5) theu 
Remove A from TPC{T) and TPCk{T) 


eud if 


eud for 


eud if 

uutil OPEN = 0 
nTPCkiT) = Vk\TPCk{T) 


eud for 


Output TPC{T) 
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In the forward stage (i.e. the case when OPEN 7^ 0), as with HITON- 
PC, MH-PC searches the current TPC{T) for a subset S to test whether 
the combined variable X is independent of T given S (line 13). Since com¬ 
bined variables in TPC{T) are combinations of individual variables, MH-PC 
may have conducted some redundant conditional independence tests. For 
example, the Cl test between X and T given a combined variable YZ (i.e. 
I{X,T I YZ)) may be unnecessary if the test given the two individual vari¬ 
ables Y and Z (i.e. I{X,T \ Y, Z)) has been done. To address this issue, 
we propose a variant of MH-PC, called MH-PC-B (B version of MH-PC). 
To avoid confusion, in the rest of the paper, we call the MH-PC algorithm 
shown in Algorithm [2] MH-PC-F (Full version of MH-PC). In the forward 
stage, when conducting the level k test with MH-PC-B, we do not include 
the level k variables in TPCk{T) into conditioning sets, i.e. we replace line 
13 in Algorithm [2] with the following statement: 


if 35 C TPC{T)\TPCk{T),s.t.I{X,T \ S) 


As MH-PC-B conducts the tests conditioning only on the lower level 
variables, it can have higher efficiency than MH-PC-F, but at the same time 
it may produce some false positives. However, since in the backward stage 
(lines 17-21 of Algorithmic]), we do another check of the candidate causes 
remained in TPCk{T), it is expected that the false discoveries are removed. 
As we will see from the next section, the experiments show that MH-PC-B is 
more efficient than MH-PC-F, while producing the same results as MH-PC-F 
with the data sets used. 

4-4- False Discoveries of Multi-level HITON-PC 

Since MH-PC-F (and MH-PC-B) follows the idea of HITON-PC, we firstly 
analyse the quality of HITON-PC in term of false discoveries. In HITON-PC, 
possible false decisions mainly come from two sources [l^,[^: the use of max- 
k, the maximum size of conditioning sets (5, see Algorithms [1] and [2|) used 
for conditional independence tests, and incorrect results of statistical tests. 
Using a smaller max-k reduces the number of conditional independence tests, 
thus improves efficiency, but results in false positive discoveries. Fortunate' 



when max-k = 3 or 4, the false positive rate is not high, as shown in [Sj 


When we do not have enough number of samples, the statistical tests may 
produce incorrect results. 

In the following, we will discuss the false discoveries coming from the 
interactions between variables. As mentioned above, the proposed method 
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only focuses on non-redundant combined variables (Definition 2). While this 
strategy is used for reducing complexity, it may lead to false discoveries. 
However, we argue that our algorithms can still obtain high-quality causal 
hndings. For non-cause variable X, if another non-cause variable Y cannot 
improve the relationship between X and the target T, then the combination 
XY, in most cases, may not improve the relationship between X and T. 
This type of combined variables are unlikely to be combined causes. The 
experiment results in Section [5] have conhrmed this intuition. 


5. Experiments 


We implemented MH-PC-F and MH-PC-B based on the semi-interleaved 


bnlearn 
h 


iij l . In the exper- 


HITON-PC implementation in the R package, 
iments, the maximum level of combination (i.e. ky^ax in Algorithm [2]) is 
restricted to 2, i.e. a combined cause at most consists of two component 
variables. We set the threshold of p-value to 0.01 to prune redundant com¬ 
bined variables and 0.05 to test causal relationships, for both synthetic and 
real world data sets. 


5.1. Data Sets 

10 synthetic and 7 real world data sets were used in the experiments, and 
a summary of the data sets is shown in Table [H The variables in all data 
sets are binary, i.e. each variable has two possible values, 1 or 0. The class 
variable in each data set is specified as the target variable. The numbers 
of variables shown in the table refer to the numbers of single predictor vari¬ 
ables. The distribution of each data set indicates the percentages of the two 
different values of class variables. For synthetic data sets, the ground truth 
(i.e. the number of true causes) is shown in the table, where the first value 
is the number of single causes each consisting of one predictor variable and 
the second value is the number of combined causes each consisting of two 
predictor variables. 

The first hve synthetic data sets (with small number of variables) in Table 
[T]were generated with two main steps: (1) generating a data set based on 
a BN (Bayesian network) created randomly by the TETRAD software tool 
(http://www.phil.cmu.edu/tetrad/), and (2) generating the final synthetic 
data set by “splitting” some causes of the target into two new variables. 
Specifically, we firstly created a random BN using the TETRAD software. 
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Table 1: A Brief Description of Data Sets 


Name 

Records 

T^Variables 

Distributions 

Ground Truth 

Syn-7 

1000 

6 

39.1% & 60.9% 

2 , 1 

Syn-10 

1000 

9 

42.2% & 57.8% 

3, 2 

Syn-12 

2000 

11 

72.1% & 27.9% 

4,3 

Syn-16 

2000 

15 

45.2% & 54.8% 

4,3 

Syn-20 

2000 

19 

55.6% & 44.4% 

4,4 

Syn-50 

5000 

49 

30.8% & 69.2% 

5, 5 

Syn-60 

5000 

59 

30.8% k 69.2% 

5, 5 

Syn-80 

5000 

79 

30.5% k 69.5% 

5, 5 

Syn-100 

5000 

99 

30.0% k 70.0% 

5, 5 

Syn-120 

5000 

119 

30.4% k 69.6% 

5, 5 

CMC 

1473 

22 

57.3% k 42.7% 

- 

German 

1000 

60 

30.0% k 70.0% 

- 

House-votes-84 

435 

16 

61.4% k 38.6% 

- 

Hypothyroid 

3163 

51 

4.8% k 95.2% 

- 

Kr-vs-kp 

3196 

74 

47.8% k 52.2% 

- 

Sick 

2800 

58 

6.1% k 93.9% 

- 

Census 

299285 

495 

6.2% k 93.8% 

- 


whose structure and conditional probability tables were both generated ran¬ 
domly. In the obtained BN, one of the variables was designated as the target 
and the others as predictor variables. Records of all variables were generated 
based on the conditional probability tables, using the built-in Bayes Instan¬ 
tiated Model. Then we selected and split a parent node, e.g. A, of the target 
into two variables, e.g. Ai and A 2 , such that (1) Aif\A 2 = A (i.e. Ai and A 2 
both are equal to 1 if and only if A is 1), and (2) Ai or A 2 is not an individual 
cause of the target. Note that, for combined causes in the synthetic data, we 
do not have a complete ground truth, since it may include some combined 
causes that we do not observe. 

For the next hve larger synthetic data sets (Syn-50, ..., Syn-120), it is 
unpractical to generate them based on randomly drawn BNs, since it takes 
too long time to generate one. We hrstly drew a simple BN where some 
variables were the parents of the target, some were not. Then we adopted 
logistic regression to generate the data based on the BN. Next, we employed 
the aforementioned splitting process to obtain the hnal data sets. 
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All real world data sets shown in Table [U are obtained from the UCI 
Machine Learning Repository 38|. The hrst six real world data sets were 


employed to assess the effectiveness of the proposed algorithms, while the 
Census data set was used for evaluating the efficiency. The CMC (Contra¬ 
ceptive Method Choice) data set is an extraction of the National Indonesia 
Contraceptive Prevalence Survey in 1987. The German data set is a data 
set for classifying people’s credit risks based on a set of attributes. House- 
votes-84 contains the United States Congressional Voting Records in 1984. 
Hypothyroid and Sick are two medical data sets, which are from the Thyroid 
Disease data set of the repository (discretised using the MLC-I--I- discretisa¬ 
tion utility 1^). The Kr-vs-kp data set is generated and described based on 
a chess game, King-Rook versus King-Pawn on A7 (usually abbreviated as 
KRKPA7). The Census data set is the Census Income (KDD) data set from 
the UCI Machine Learning Repository. In our experiments, all continuous 
attributes have been removed from the original data sets. 

5.2. Performance Evaluation 

Three sets of experiments with the synthetic data were done to assess the 
accuracy of MH-PC-F and MH-PC-B by examining the results against the 
ground truth. 

Firstly we compared MH-PC-F and MH-PC-B with two naive approaches 
using HITON-PC and PC-select respectively (denoted as Naive-H and 
Naive-S in the following). PC-select is an effective method for discovering the 
parents and children of a target variable, so we employed it as a benchmark 
for accuracy comparison. 

Because the two naive methods, especially Naive-S, cannot handle large 
data sets, two small synthetic data sets (Syn-7 and Syn-10) were used in this 
set of experiments. Moreover, it is easier for small data sets to provide a 
good visualization of the detailed results. 

The ground truth of the Syn-7 data set is that U3 and U4 are two single 
causes of the target and U1&U2 is a combined cause (see the Ground truth 
column of Table [H where Yes means the predictor variable is a cause of 
the target, and No means otherwise). In Tabled MH-PC-F and MH-PC-B 
hnd exactly the ground truth in Syn-7. While Naive-H identihes the ground 
truth, it includes a number of redundant results, for example, U3&U5 and 
U4&:U5 since U3 and U4 are causes already. Naive-S misses the combined 
cause (U1&U2) and it hnds some redundant combined causes too. Similar 
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Table 2: Comparison of the proposed algorithms with the naive methods 


Data 

set 

Predictor 

variables 

Ground 

truth 

Naive-H 

Naive-S 

MH-PC-F 

MH-PC-B 


R3 

Yes 

V 

V 

\/ 



VA 

Yes 

V 

V 

V 

V 

Syn-7 

R1&R2 

Yes 

V 


V 

V 


R3&1/5 

No 

V 

V 




R4&1/5 

No 

V 

V 




R5 

Yes 

v 

\/ 

\/ 

\/ 


V6 

Yes 

V 


V 

V 


V7 

Yes 






VlkV2 

Yes 

\/ 

\/ 

\/ 

\/ 

Syn-10 

1/3&R4 

Yes 



V 



VlkV3 

No 






R4&R5 

No 






1/5&R6 

No 


V 




R5&1/9 

No 






results can be observed with the Syn-10 data set. MH-PC-F and MH-PC-B 
miss the true single cause, V7, and the naive methods do not hnd it either. 

Then we compared MH-PC-F and MH-PC-B with CR-CS and CR- 
PA 32| using three synthetic data sets, Syn-12, Syn-16 and Syn-20. CR-CS 


and CR-PA are both designed to explore causal relationships from associ¬ 
ation rules, and they are also capable of hnding both single and combined 
causes. The results are shown in Table [3l where P, R and Fi represent the 
Precision, Recall and Fi-measure respectively. In the paper, we used odds 
ratio greater than 1.5 as the threshold to indicate a signihcant result in both 
CR-CS and CR-PA. We can see that MH-PC-F and MH-PC-B both achieve 
higher accuracy than CR-CS and CR-PA, based on the known ground truth. 
Actually, CR-CS and CR-PA both perform very well in term of Recall, but 
they also include many false positives, since a main aim of these two methods 
is for explorations and they tolerate false positives and seek high recall. 

In the next set of experiments, the last hve larger synthetic data sets 
in Table [T] were used. From Table IU all the four algorithms (i.e. CR-CS, 
CR-PA, MH-PC-F and MH-PC-B) can recover the ground truth very well 
from the data sets with relatively large sizes. 
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Table 3: Comparison of combined causes discovered by CR-CS, CR-PA, MH- 
PC-F and MH-PC-B with small synthetic data sets 



Syn-12 

Syn-16 

Syn-20 


P 

0.25 

0.23 

0.36 

CR-CS 

R 

1.00 

1.00 

1.00 


Fi 

0.40 

0.38 

0.53 


P 

0.16 

0.17 

0.27 

CR-PA 

R 

1.00 

1.00 

1.00 


Fi 

0.27 

0.29 

0.42 


P 

0.67 

0.50 

1.00 

MH-PC-F 

R 

0.67 

1.00 

1.00 


Fi 

0.67 

0.67 

1.00 


P 

0.67 

0.50 

1.00 

MH-PC-B 

R 

0.67 

1.00 

1.00 


Fi 

0.67 

0.67 

1.00 


Table 4: Comparison of combined causes discovered by CR-CS, CR-PA, MH- 
PC-F and MH-PC-B with larger synthetic data sets 



Syn-50 

Syn-60 

Syn-80 

Syn-100 

Syn-120 


P 

0.71 

1.00 

0.71 

0.83 

1.00 

CR-CS 

R 

1.00 

1.00 

1.00 

1.00 

1.00 


Fi 

0.83 

1.00 

0.83 

0.91 

1.00 


P 

0.71 

1.00 

0.83 

0.83 

1.00 

CR-PA 

R 

1.00 

1.00 

1.00 

1.00 

1.00 


Fi 

0.83 

1.00 

0.91 

0.91 

1.00 


P 

0.83 

0.71 

1.00 

0.83 

0.83 

MH-PC-F 

R 

1.00 

1.00 

1.00 

1.00 

1.00 


Fi 

0.91 

0.83 

1.00 

0.91 

0.91 


P 

0.83 

0.71 

1.00 

0.83 

0.83 

MH-PC-B 

R 

1.00 

1.00 

1.00 

1.00 

1.00 


Fi 

0.91 

0.83 

1.00 

0.91 

0.91 


Based on the results of three sets of experiments, it is reasonable to con¬ 
clude that MH-PC-F and MH-PC-B are capable to hnd single and combined 
causes. Another hnding is that the causes (single and combined) identihed 
by MH-PC-F and MH-PC-B are always the same, and this indicates two 
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Table 5: Number of (single and combined) causes discovered by MH-PC-F 
and MH-PC-B in real world data sets 



No. of single causes 

No. of combined causes 

CMC 

2 

0 

German 

1 

13 

House-votes-84 

0 

18 

Hypothyroid 

3 

4 

Kr-vs-kp 

7 

12 

Sick 

3 

8 


Table 6: Examples of combined causes identified from Sick and German data 
sets 


Sick 

T3 < 1.151 & TT4 < 87.5 
sick = true & T3 < 1.151 

German 

Checking.account = no-account & Savings.account < lOODM 

Property = real.estate & Other.installment.plans = none 


algorithms can achieve consistent results. This is also demonstrated by the 
results of two algorithms with all real world data sets, as described in the 
following. 

To investigate combined causes in the real world cases, we ran the pro¬ 
posed algorithms on the hrst six real world data sets in Table [U for per¬ 
formance evaluation, where MH-PC-F and MH-PC-B still return consistent 
results as shown in Table |5l The proposed algorithms hnd many combined 
causes, and some of the combined causes discovered are reasonable as judged 
by common sense, shown in Table O For example, from the Sick data set 
it is found that a low level of TT4 (Total T4) and T3 may result in thy¬ 
roid disease (Tabled where T4 and T3 are hormones produced by thyroid), 
and being sick and having a low level of T3 can lead to thyroid disease too. 
Some interesting combined causes are also discovered in the German data 
set. If one person has a private real estate and does not apply for any other 
installment plan, then this person is very likely to have a low default risk. 

5.3. Efficiency and Scalability 

We ran NaiVe-H, NaiVe-S, CR-CS, CR-PA, MH-PC-F and MH-PC-B with 
various data sets on the same computer with a 3.4 GHz quad-core CPU and 
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Figure 2: Scalability with number of variables - Census data 


16 GB of memory. 

The running time of the algorithms on subsets of the Census data con- 



Figure 3: Scalability with number of variables - Synthetic data 
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Figure 4: Scalability with umber of records - Census data 


taining 30, 50, 70, 100 and 150 variables with the same sample size (50K) is 
shown in Figure [2l The two naive methods are much slower than MH-PC-F 
and MH-PC-B, and Naive-S is the most inefficient one. While the two naive 
methods do not scale well with the number of variables, the two proposed 
algorithms both perform good scalability. 

When applying the algorithms to the synthetic data sets containing dif¬ 
ferent numbers of variables, both naive methods do not return results after 
5 hours. So no results of naive methods are shown in Figure [3l From the 
hgure, both proposed algorithms again scale well. 

We then ran the algorithms with 50K, lOOK, 150K, 200K, and 250K 
samples respectively from the Census data set with 100 variables selected 
randomly, and the execution time of MH-PC-F and MH-PC-B is shown in 
Figure HI No results are obtained for Naive-S, and Naive-H also cannot 
handle data sets with more than 50K samples. Similarly, MH-PC-B is more 
efficient and scalable than MH-PC-F. 

To summarise, MH-PC-F and MH-PC-B are much faster than the naive 
methods, and both proposed algorithms scale well in terms of the number of 
variables and number of samples. The experiments have also conhrmed the 
discussions in Section 14731 that MH-PC-B can achieve higher efficiency than 
MH-PC-F. 
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6. Conclusion 


In practice, it is useful to identify a cause consisting of multiple variables, 
which individually are not causes of the target variable. However, hnding 
such combined causes is challenging as the number of combined variables will 
increase exponentially with the increase of the number of individual variables. 
As far as we know, there has been very little work on discovering the combined 
causes, and the problem has not been studied in causal Bayesian network 
research either. 

In this paper, we have proposed two efficient algorithms to mine the 
combined causes from large data sets. The proposed algorithms are based on 
a well-designed local causal discovery method, the semi-interleaved HITON- 
PC algorithm, with the novel extensions for dealing with combined causes. 
Experiments have shown that the proposed algorithms can End single and 
combined causes with a low number of false discoveries from synthetic data 
sets, and discover many reasonable combined causes from real world data. 
Additionally, the algorithms have been shown to scale up well with respective 
to the number of variables and the number of samples with both synthetic 
and real world data. 

In the near future, we will apply the proposed algorithms to solving real 
world problems, such as investigating the mechanisms of gene regulation, for 
which there is evidence showing that many gene regulators work together to 
regulate their target genes. 
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